library(ggplot2)
library(dplyr)
library(ggmap)
library(tidyr)
library(cluster)
library(lubridate)
library(rgl)
library(maptools)
library(ggpubr)
library(viridis)
library(tidyverse)
library(scatterplot3d)
library(factoextra)
library(fpc)
library(NbClust)
library(reshape2)
library(scales)
library(plyr)
library(caret)
library(ROCR)
library(pROC)
library(rlist)
library(Hmisc)
library(corrplot)
library(ggcorrplot)
library(DMwR)
library(ggthemes)
library(e1071)
library(maps)
library(RColorBrewer)
library(rgl)
model_dir = "models"
data_dir = "data"
map_dir = "map"
saved_maps = list.files(map_dir)
### Loading data
data=read.csv(paste(data_dir,"MCI_2014_to_2017.csv",sep="/"), header = TRUE, sep = ",")
### Loading map
for(file in saved_maps) {
load(paste(map_dir,file,sep="/"))
}
The second assignment for the York Machine Learning course, Machine Learning in Business Context was to explore unsupervised machine learning algorithms, specifically clustering. We chose a dataset from the Toronto Police Sercive Public Safety Data Portal, MCI 2014 to 2017. This report follows the structures laid out in CRISP-DM methodology.
The GitHub repository for all the source code is located at the following link: link here.
The RShiny app is located at the following link: link here.
The Toronto Police Service is the police force that serves the Greater Toronto Area. It is the largest municipal police force in Canada and the third largest police force in Canada. They are a taxpayer-funded service, ranking as second for the government of Toronto’s budgetary expenses.
The objective of this model is to cluster crimes to determine which areas of Toronto have the most levels of crime, overall and for different Major Crime Indicators. This would enable the Toronto Police to most effectively allocate their officers and specialists to the areas that require them the most. The hope is that this would be an effective way to lower crime rates and enable more cases to be solved, all without spending more money.
There are some ethical implications of using crime data. There are many avenues for bias to enter the data set. Police services around North America have come under increased scrutiny in recent years for racist policing. Some police policies or laws inherently disadvantage certain groups of people, which would create bias in the data. This means that conclusions drawn from a machine learning model based on biased data would create biased results. The conclusions should be looked at with other data, such as demographic data, and supplementary information, such as social considerations.
The data set was provided courtesy of the Toronto Police Service Open Data Portal. It is usable in accordance with the Open Government License - Ontario.
The data concerns all Major Crime Indicators (MCI) occurrences in Toronto by reported date, between 2014 and 2017. The MCIs are Assault, Break and Enter, Auto Theft, and Theft Over (excludes Sexual Assaults). Locations in the data set are approximate, as they have been deliberately offset to the nearest road intersection to protect the privacy of involved individuals.
There are 29 columns with 131,073 observations. However, 4 of these columns are duplicates, bringing us to 25 columns. There are many attributes: date/time-type data, crime-type data, and location type data.
The following table shows the date/time attributes:
| Attribute | Description |
|---|---|
| Occurrence Date | Date of occurrence |
| Occurrence Year | Year of occurrence |
| Occurrence Month | Month of occurrence |
| Occurrence Day | Day of occurrence |
| Occurrence Day of Year | Day of year of occurrence |
| Occurrence Day of Week | Day of week of occurrence |
| Occurrence Hour | Hour of occurrence |
| Reported Date | Date of report |
| Reported Year | Year of report |
| Reported Month | Month of report |
| Reported Day | Day of report |
| Reported Day of Year | Day of year of report |
| Reported Day of Week | Day of week of report |
| Reported Hour | Hour of report |
While this table shows the categorical attributes:
| Attribute | Description |
|---|---|
| Premise Type | The type of premise (outside, house, commercial, etc.) where the crime happened |
| Major Crime Indicator | The major crime indicator (type of crime) |
| Offence | The specific offence that happened |
| UCR Code | Uniform Crime Reporting Code - categorizes the crime |
| UCR Extension | Uniform CRime Reporting Extension - further categorizes the crime |
This table shows location attributes:
| Attribute | Description |
|---|---|
| Division | The Toronto Police Division |
| Neighbourhood | Toronto Neighbourhood |
| Hood ID | Toronto neighbourhood ID |
| Lat | Latitude |
| Long | Longitude |
Visualization crime level for all neighbourhoods.
shpfile <- paste(data_dir,"NEIGHBORHOODS_WGS84_2.shp",sep="/")
sh <- readShapePoly(shpfile)
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
sh@data$AREA_S_CD <- as.integer(sh@data$AREA_S_CD)
total_offence_cnt_table = data %>% group_by(Hood_ID) %>% dplyr::summarise(offence_cnt = n())
hood_total_offence_cnt_table = merge(total_offence_cnt_table,sh@data,by.x='Hood_ID',by.y='AREA_S_CD')
points_offense_cnt <- fortify(sh, region = 'AREA_S_CD')
points_offense_cnt <- merge(points_offense_cnt, hood_total_offence_cnt_table, by.x='id', by.y='Hood_ID', all.x=TRUE)
torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=offence_cnt), data=points_offense_cnt, color='black') +
scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))
Visualization Toronto Crimes Heatmap across neighboughhoods
base_size <- 9
heat_group <- group_by(toronto, Neighbourhood, offence)
heat_count <- dplyr::summarise(heat_group,Total = n())
heat_count$Neighbourhood <- with(heat_count,reorder(Neighbourhood,Total))
heat_count.m <- melt(heat_count)
## Using Neighbourhood, offence as id variables
heat_count.m <- ddply(heat_count.m, .(variable), transform,rescale = rescale(value))
ggplot(heat_count.m, aes(Neighbourhood, offence)) +
geom_tile(aes(fill = rescale),colour = "white") +
ggtitle("Toronto Criminal Heatmap") +
scale_fill_gradient(low = "lightblue",high = "darkblue") +
theme_grey(base_size = base_size) +
labs(x = "", y = "") +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
theme_minimal() +
theme(legend.position = "none",axis.ticks = element_blank(), axis.text.x = element_text(size = base_size *0.5, angle = 270, hjust = 0, colour = "grey50"),axis.text.y = element_text(size = base_size *0.5))
We also can visualize some interesting information from the dataset e.g.: Finding the top dangerous or top safe zones, neighbourhoods in Toronto based on all MCI or on a specific MCI. The following charts display top 5 dangerous and top 5 safe neighbourhoods in Toronto on total MCI. For more options on other top neighbourhoods/divisions on total MCI or for a specific MCI, please explore on our application.
location_group <- group_by(toronto, Neighbourhood)
crime_by_location <- dplyr::summarise(location_group, n=n())
crime_dangerous <- crime_by_location[order(crime_by_location$n, decreasing = TRUE), ]
crime_dangerous_top <- head(crime_dangerous, 5)
ggplot(aes(x = reorder(Neighbourhood, n), y = n, fill = reorder(Neighbourhood, n)), data = crime_dangerous_top) +
geom_bar(stat = 'identity', width = 0.6) +
ggtitle("Top 5 dangerous neighbourhoods in Toronto") +
coord_flip() +
xlab('Zone') +
ylab('Number of Occurrences') +
scale_fill_brewer(palette = "Reds") +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"),
legend.position="none")
crime_safe <- crime_by_location[order(crime_by_location$n, decreasing = FALSE), ]
crime_safe_top <- head(crime_safe, 5)
ggplot(aes(x = reorder(Neighbourhood, -n), y = n, fill = reorder(Neighbourhood, -n)), data = crime_safe_top) +
geom_bar(stat = 'identity', width = 0.6) +
ggtitle("Top 5 safe neighbourhoods in Toronto") +
coord_flip() +
xlab('Zone') +
ylab('Number of Occurrences') +
scale_fill_brewer(palette = "Greens") +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"),
legend.position="none")
It seems like someone reported crimes over 40 years ago. But since nothing stops people from doing this, we are not going to treat these as outliers.
data$occurrencedate <- ymd(gsub("(.*)T.*", "\\1", data$occurrencedate))
data$reporteddate <- ymd(gsub("(.*)T.*", "\\1", data$reporteddate))
data[which(data$occurrencedate < as.POSIXct("1970-01-01")),]
## X Y Index_ event_unique_id occurrencedate
## 1257 -79.53082 43.64304 147 GO-2015636294 1966-06-09
## 11889 -79.44592 43.68811 5156 GO-20143212768 1968-01-01
## reporteddate premisetype ucr_code ucr_ext offence reportedyear
## 1257 2015-04-17 Apartment 1430 100 Assault 2015
## 11889 2014-10-31 House 1430 100 Assault 2014
## reportedmonth reportedday reporteddayofyear reporteddayofweek
## 1257 April 17 107 Friday
## 11889 October 31 304 Friday
## reportedhour occurrenceyear occurrencemonth occurrenceday
## 1257 15 NA NA
## 11889 11 NA NA
## occurrencedayofyear occurrencedayofweek occurrencehour MCI
## 1257 NA 0 Assault
## 11889 NA 9 Assault
## Division Hood_ID Neighbourhood Lat Long
## 1257 D22 14 Islington-City Centre West (14) 43.64304 -79.53082
## 11889 D13 107 Oakwood Village (107) 43.68811 -79.44592
## FID
## 1257 257
## 11889 5889
The rest of the data is mostly categorical in their nature, so no concerns need to be addressed in terms of data outliers here.
The following code documents the preprocessing steps idendified for this dataset.
#reload the data
data <- read.csv(paste(data_dir,"MCI_2014_to_2017.csv",sep="/"), header = TRUE, sep = ",")
First, we want to remove any duplicate data - rows or columns. Some events have duplicated event IDs and should be removed. We also have duplicate columns for X/Y and Lat/Long, which should be removed. We are don’t use the UCR codes or the ID numbers, so they’re also removed.
#remove duplicate rows/entries
data <- subset(data, !duplicated(data$event_unique_id))
#remove columns that aren't used/duplicates
data <- data[, !colnames(data) %in% c("?..X","Y","Index_","event_unique_id","ucr_code","ucr_ext","FID")]
Next, we format the dates. There are garbage time values at the end of the dates, which are removed. The other date values are also changed into appropriate date/time values. Whitespace is also present in the day of week columns, so that is trimmed.
#formatting dates - remove garbage time values at the end
data$occurrencedate <- gsub("(.*)T.*", "\\1", data$occurrencedate)
data$reporteddate <- gsub("(.*)T.*", "\\1", data$reporteddate)
data$occurrencetime = ymd_h(paste(data$occurrencedate,data$occurrencehour,sep = " "), tz="EST")
data$reportedtime = ymd_h(paste(data$reporteddate,data$reportedhour,sep = " "), tz="EST")
data$occurrencedate = ymd(data$occurrencedate)
data$reporteddate = ymd(data$reporteddate)
#removing whitespace from day of week
data$occurrencedayofweek <- as.factor(trimws(data$occurrencedayofweek, "b"))
data$reporteddayofweek <- as.factor(trimws(data$reporteddayofweek, "b"))
We also convert the month/day of week from string representation to integer representation:
data$reportedmonth = month(data$reportedtime)
data$reporteddayofweek = wday(data$reportedtime)
data$occurrencemonth = month(data$occurrencetime)
data$occurrencedayofweek = wday(data$occurrencetime)
Now let’s take a look at the missing data:
#missing data
colSums(is.na(data))
## X occurrencedate reporteddate
## 0 0 0
## premisetype offence reportedyear
## 0 0 0
## reportedmonth reportedday reporteddayofyear
## 0 0 0
## reporteddayofweek reportedhour occurrenceyear
## 0 0 32
## occurrencemonth occurrenceday occurrencedayofyear
## 0 32 32
## occurrencedayofweek occurrencehour MCI
## 0 0 0
## Division Hood_ID Neighbourhood
## 0 0 0
## Lat Long occurrencetime
## 0 0 0
## reportedtime
## 0
NAdata <- unique (unlist (lapply (data, function (x) which (is.na (x)))))
Rows with NA data:
NAdata
## [1] 921 922 923 964 986 1036 1037 1086 1087 1088 1157
## [12] 1158 1159 1211 1228 1508 1509 1510 1689 1690 1691 1692
## [23] 3330 3331 10372 10373 13858 13859 13860 13861 95190 98356
Occurence dates for rows with NA data:
data$occurrencedate[NAdata]
## [1] "1989-01-01" "1998-01-01" "1991-01-01" "1998-01-01" "1999-10-01"
## [6] "1998-01-01" "1996-01-01" "1966-06-09" "1980-04-24" "1970-01-01"
## [11] "1999-01-01" "1996-01-01" "1992-12-21" "1996-01-31" "1998-06-01"
## [16] "1974-01-01" "1995-01-01" "1990-01-01" "1988-01-01" "1988-01-01"
## [21] "1973-08-31" "1999-03-24" "1999-01-01" "1982-03-13" "1968-01-01"
## [26] "1995-01-01" "1994-01-01" "1989-01-01" "1987-01-01" "1989-11-20"
## [31] "1996-01-01" "1978-04-10"
We can see that there are 32 missing values, all in occurrence date related columns. Upon further inspection, these are all the same rows. We can also see that the occurrence date value is correct, so these date type columns can have their missing values imputed.
#imputing occurence dates from occurence date field
data$occurrenceyear[NAdata] <- year(data$occurrencedate[NAdata])
data$occurrencemonth[NAdata] <- month(data$occurrencedate[NAdata], label = TRUE, abbr = FALSE)
data$occurrenceday[NAdata] <- day(data$occurrencedate[NAdata])
data$occurrencedayofweek[NAdata] <- wday(data$occurrencedate[NAdata], label = TRUE, abbr = FALSE)
data$occurrencedayofyear[NAdata] <- yday(data$occurrencedate[NAdata])
Now we replace the space in the strings with an underscore for easier processing:
#replace space in string
data$offence <- gsub("\\s", "_", data$offence)
data$MCI <- gsub("\\s", "_", data$MCI)
Next, all columns are converted into factors except Lat, Long, reportedtime, and occurrencetime. Unused factor levels are also dropped (resulted from missing values).
#change things to factors
for(col in c("offence","MCI","Division","Hood_ID")) {
data[,col] = as.factor(data[,col])
}
#if we use the gower distance and daisy() function, the following metrics can be considered to converted to ordered factor; but since gower distance turns out to be too expensive for large dataset, we have treated the following as normal factors as well!
for(col in c("reportedyear","reportedmonth","reportedday","reporteddayofyear","reporteddayofweek",
"reportedhour","occurrenceyear","occurrencemonth","occurrenceday","occurrencedayofyear",
"occurrencedayofweek","occurrencehour")) {
data[,col] = factor(data[,col])
}
#drop unused factor levels
for(col in names(data)) {
if(is.factor(data[,col])) {
data[,col] = droplevels(data[,col])
}
}
Finally, we cherck for missing data one last time:
# Check missing values one last time
colSums(is.na(data))
## X occurrencedate reporteddate
## 0 0 0
## premisetype offence reportedyear
## 0 0 0
## reportedmonth reportedday reporteddayofyear
## 0 0 0
## reporteddayofweek reportedhour occurrenceyear
## 0 0 0
## occurrencemonth occurrenceday occurrencedayofyear
## 0 0 0
## occurrencedayofweek occurrencehour MCI
## 0 0 0
## Division Hood_ID Neighbourhood
## 0 0 0
## Lat Long occurrencetime
## 0 0 0
## reportedtime
## 0
With the data prepared, we can now start looking at models.
First, we can look at clustering crime by neighborhood. We need to coerce the data into a clusterable table, sorted by MCI and neighborhood.
# Neighbourhood first
#first, coerce the data into a table that can be clustered - we aren't interested in the occurence date at this point
#courtesy of Susan Li - https://datascienceplus.com/exploring-clustering-and-mapping-torontos-crimes/
bygroup <- group_by(data, MCI, Hood_ID)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("Hood_ID", "MCI", "n")]
hood <- as.data.frame(spread(groups, key=MCI, value=n))
hood_id = as.integer(hood[,"Hood_ID"])
hood = hood[,-1]
head(hood)
## Assault Auto_Theft Break_and_Enter Robbery Theft_Over
## 1 925 1091 501 243 187
## 2 867 203 132 302 14
## 3 203 59 84 73 10
## 4 233 74 61 57 4
## 5 206 53 56 66 4
## 6 389 175 144 73 19
Then we can use this table to perform k-means clustering. First, we need to normalize the data and determine the number of clusters. To determine the number of clusters, we simply plot the within-cluster sum-of-squares and pick a number after inspecting this plot.
hood <- scale(hood)
#determine number of clusters
wssplot <- function(data, nc=15, seed=1234) {
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc) {
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)
}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
#we can see there's an elbow around 3 clusters
wssplot(hood, nc=15)
Now it seems number 3 is a good choice of the number of clusters. We then proceed to perform K-means clustering on the data-set. We have followed this tutorial(https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/) to evaluate the quality of clustering results. Specifically, we have used Silhouette width as a quantitative measures to evaluate the clustering quality. As we will see in the Silhouette plot, both in cluster #2 and cluster #3 have data points where the Silhouette width falls to negative, indicating a not so ideal clustering. This can also be observed in the 1st plot where we see that cluster #2 and cluster #3 is quite close to each other.
# K-means clustering
km.res <- eclust(hood, "kmeans", k = 3, graph = T)
fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 10 0.01
## 2 2 37 0.21
## 3 3 93 0.59
Let’s look at the cluster means from the outputs to see what we can learn.
# k-means results
km.res
## K-means clustering with 3 clusters of sizes 10, 37, 93
##
## Cluster means:
## Assault Auto_Theft Break_and_Enter Robbery Theft_Over
## 1 2.5319589 1.6064456 2.5015651 2.3608452 2.9465633
## 2 0.5229260 0.3375853 0.6012997 0.5174595 0.2914680
## 3 -0.4802995 -0.3070442 -0.5082122 -0.4597253 -0.4327952
##
## Clustering vector:
## [1] 1 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 2 3 3 2 2 2 1 3 3 3 2 3 3 3 3
## [36] 3 3 3 2 2 3 3 3 3 2 3 2 3 3 2 2 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2
## [71] 3 3 1 2 1 1 1 1 2 3 3 2 3 3 3 2 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 2 3 2 3 3 3 2 3 2 2 2 2 3 2 3 2 2 2 3 2 2 2 3 3 3 2 1 2 3 3
##
## Within cluster sum of squares by cluster:
## [1] 153.40984 68.48510 45.97407
## (between_SS / total_SS = 61.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "clust_plot" "silinfo" "nbclust"
## [13] "data"
We can see that cluster 3 has lower than average crime. It has 93 neighborhoods, meaning that the majority of Toronto is quite safe in terms of the number of incidents occurring. Cluster 2 has slightly above average crime numbers, and it only has 37 neighborhoods. 10 neighborhoods have much higher than average crime numbers, which is seen in cluster 3. In some way, we can interpret the results as such that crime is concentrated in these small pockets of Toronto.
As an interesting activity, we also looking at a 3D representation of the clustering results. With 3 principal components, more variations should be captured that as in the case in 2 principal components. With a third dimension, we can see that there is more of a spread than initially can be seen simply in two dimensions.
#3D plot
pc <-princomp(hood, cor=TRUE, scores=TRUE)
plot3d(pc$scores[,1:3], col=km.res$cluster, main="k-means clusters")
Next, we can look at the neighborhoods from a hierarchical clustering approach. Again, we need to determine how many clusters we want to have. For hierarchical clustering, we look at the total number of observations that end up in different clusters with different configurations(in this case, the configuration is the number of clusters).
counts <- sapply(2:6, function(ncl) eclust(hood, "hclust", k = ncl, hc_metric = "euclidean", hc_method = "ward.D2")$size)
names(counts) <- 2:6
counts
## $`2`
## [1] 10 130
##
## $`3`
## [1] 10 44 86
##
## $`4`
## [1] 1 44 86 9
##
## $`5`
## [1] 1 44 86 7 2
##
## $`6`
## [1] 1 15 86 7 29 2
Still, we can see that 3 clusters are not bad if we consider only the cluster size as the criteria for choosing the number of clusters. We don’t want to split the neighborhoods into clusters with a small number of neighborhoods.
Now that we have chosen the number of clusters, we proceed to the clustering process. As previously in k-means, we look at Silhouette width as a quantitative measure to evaluate the clustering quality. To visually inspect the clusters, we plot dendrograms.
# Hierarchical clustering
hc.res <- eclust(hood, "hclust", k = 3, hc_metric = "euclidean",
hc_method = "ward.D2")
# Visualize dendrograms
fviz_dend(hc.res, show_labels = FALSE,
palette = "jco", as.ggplot = TRUE)
fviz_silhouette(hc.res, palette = "jco", ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 10 0.04
## 2 2 44 0.16
## 3 3 86 0.61
It seems like we have reached the same conclusion as the k-means algorithms: cluster #1 and cluster #2 does not have large gap. The following codes plot the clustering results on the map. We also plot the offence count in each neighbourhood and it can be seen that the 2nd plot is highly correlated with the 1st plot (which makes sense).
cluster_ids <- km.res$cluster
hood_ids_and_cluster_ids <- data.frame(cbind(hood_id,cluster_ids))
hood_ids_and_cluster_ids$cluster_ids = as.factor(hood_ids_and_cluster_ids$cluster_ids)
shpfile <- paste(data_dir,"NEIGHBORHOODS_WGS84_2.shp",sep="/")
sh <- readShapePoly(shpfile)
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
sh@data$AREA_S_CD <- as.integer(sh@data$AREA_S_CD)
hood_name_and_cluster_ids = merge(hood_ids_and_cluster_ids,sh@data,by.x='hood_id',by.y='AREA_S_CD')
points_clustering <- fortify(sh, region = 'AREA_S_CD')
points_clustering <- merge(points_clustering, hood_name_and_cluster_ids, by.x='id', by.y='hood_id', all.x=TRUE)
p1 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=cluster_ids), data=points_clustering, color='black') +
scale_fill_brewer(palette = "Set2")
p2 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=offence_cnt), data=points_offense_cnt, color='black') +
scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))
p3 = ggarrange(p1, p2, ncol = 2, nrow = 1, common.legend = F)
print(p3)
We can perform clustering on lat/long to look at natural crime hotspots. We can compare this to a manual cluster like the Toronto police divisions.
latlong <- data[, colnames(data) %in% c("Lat", "Long")]
# k-means
# 34 divisions in Toronto
k.means.fit <- kmeans(latlong, 34)
torclus <- as.data.frame(k.means.fit$centers)
torclus$size <- k.means.fit$size
latlongclus <- latlong
latlongclus$cluster <- as.factor(k.means.fit$cluster)
tormap <- get_map(location =c(left=-79.8129, bottom=43.4544, right=-78.9011, top=43.9132))
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=43.6838,-79.357&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN
#plot each incident, color-coded by cluster
ggmap(tormap) +
geom_point( data= latlongclus, aes(x=Long[], y=Lat[], color= as.factor(cluster))) +
theme_void() + coord_map()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
#plot bubble map with cluster centroids, size/ color determined by the number of incidents in each cluster
ggmap(tormap) +
geom_point( data= torclus, aes(x=Long[], y=Lat[], size=size)) +
theme_void() + coord_map()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
We can see how the clusters look with both the incidents plotted as well as the centroids. Next, we can perform k-means clustering on the natural crime hotspot clusters and on the Toronto Police divisions to see how they compare to one another.
#coerce data for Hotspots
data2 <- data
data2$cluster <- k.means.fit$cluster
bygroup <- group_by(data2, MCI, cluster)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("cluster", "MCI", "n")]
hotspot <- data.frame(spread(groups, key=MCI, value=n))
hotspot <- hotspot[, -1]
hotspot = data.frame(scale(hotspot))
#determine number of clusters
#we can see there's an elbow around 4 clusters
wssplot(hotspot, nc=15)
We can see here that there is an elbow around 4, so we can run k means with 4 clusters. With these 4 clusters, we can see that cluster 4 has higher than average auto thefts, and slightly higher robberies, but is below average for the other 3 MCIs. Cluster 1 is a safe part of Toronto with lower crime incidents. Cluster 3 has slightly more assaults, break and enters, and robberies, but lower auto thefts and theft overs. Cluster 2, however, is the most crime-centric area of Toronto, with many more incidents than average, excepting auto thefts. Luckily, there are only 3 hotspots in cluster 2, while there are 12 and 13 in clusters 1 and 3 respectively, which had generally fewer incidents overall.
From the silhouette plot, we can see that clustering based on this data set is a little better as compared to Clustering strategy I. When we plot the clusters using principal component analysis in 2D, we can see that the clusters are relatively well separated. Cluster 1 and 4 are spread out, but clusters 2 and 3 are tighter. On the map, we can also see that cluster 4 is concentrated in downtown Toronto, while cluster 1 is in the northwest part.
km.res <- eclust(hotspot, "kmeans", k = 4, graph = T)
fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 16 0.22
## 2 2 4 0.22
## 3 3 4 0.24
## 4 4 10 0.46
km.res
## K-means clustering with 4 clusters of sizes 16, 4, 4, 10
##
## Cluster means:
## Assault Auto_Theft Break_and_Enter Robbery Theft_Over
## 1 -0.03088279 -0.2256176 0.3005858 0.02643693 -0.3488020
## 2 0.55292426 2.4255005 -0.5324398 1.12469735 0.5700703
## 3 1.72830722 -0.6195701 1.6301725 1.22286588 2.0263084
## 4 -0.86308013 -0.3613840 -0.9200304 -0.98132438 -0.4804683
##
## Clustering vector:
## [1] 4 1 1 4 1 1 4 3 3 4 4 1 1 1 4 2 2 4 1 1 1 1 1 2 2 3 1 4 3 1 4 1 4 1
##
## Within cluster sum of squares by cluster:
## [1] 22.932432 10.847959 12.848323 6.212482
## (between_SS / total_SS = 68.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "clust_plot" "silinfo" "nbclust"
## [13] "data"
hotspot$cluster <- factor(km.res$cluster)
ggmap(tormap) +
geom_point( data= torclus, aes(x=Long[], y=Lat[], color = hotspot$cluster)) +
theme_void() + coord_map()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
The following code documents the third way we have tried in terms of clustering. Instead of aggregating the dataset based on neighbourhood, we can aggregate the dataset according to the police division that each event belongs to.
#coerce data for Divisions
bygroup <- group_by(data, MCI, Division)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("Division", "MCI", "n")]
div <- as.data.frame(spread(groups, key=MCI, value=n))
div_name = div[,1]
div <- div[, -1]
#normalize
for(col in names(div)) {
div[,col] = (div[,col] - mean(div[,col])) / sd(div[,col])
}
#determine number of clusters
#we can see there's an elbow around 3 clusters
wssplot(div, nc=15)
# k-means
km.res <- eclust(div, "kmeans", k = 3, graph = T)
fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 8 0.46
## 2 2 9 0.21
## 3 3 17 0.96
km.res
## K-means clustering with 3 clusters of sizes 8, 9, 17
##
## Cluster means:
## Assault Auto_Theft Break_and_Enter Robbery Theft_Over
## 1 0.4761876 0.1302555 0.5495307 0.3599759 0.5014688
## 2 1.2770657 1.2633142 1.1990178 1.3373206 1.1973892
## 3 -0.9001819 -0.7301101 -0.8933768 -0.8773937 -0.8698972
##
## Clustering vector:
## [1] 1 3 1 3 1 3 2 3 2 3 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3 1 3 1 3 1 3 1 3
##
## Within cluster sum of squares by cluster:
## [1] 6.9416751 17.1183040 0.1890451
## (between_SS / total_SS = 85.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "clust_plot" "silinfo" "nbclust"
## [13] "data"
Similar to clustering after neighborhood-based aggregation, two clusters have low crime incidents (1 and 2), while cluster 3 has higher crime incidents. Most districts are lower crime incident districts, while 9 districts are higher. However, from both the visual representation (based on principal components) of the clustering and silhouette plot, police-division-based aggregation clearly works better than neighborhood-based aggregation.
So far we have aggregated crime statistics into either neibourhood or division. What if we run clustering on data that excludes geographical information. Does such clustering produce naturally occuring crime zone? We will find out in the following analysis.
library(clustMixType)
data_fullclust = data[,c("occurrencetime","reportedtime","premisetype","offence","MCI")]
data_fullclust$occurrencetime = as.numeric(data_fullclust$occurrencetime)
data_fullclust$reportedtime = as.numeric(data_fullclust$reportedtime)
However, one important distinction as compared to previous clustering analysises is that, we now have mixed data type, being numeric and categorical. To handle this situation, we have experimented using the daisy function and Gower’s distance formula. However, Gower’s distance formula drastically increase memory consumption when the dataset is large, making it not pratical to run on our own computers. As such, we swtiched to another algorithms “k-prototypes clustering” that can handle mixed type data. As in kmeans, we determine the number of clusters based on the trends shown in within cluster sum of squares.
library(rlist)
set.seed(1234)
kproto_clusters = list()
for (i in seq(2,8,1)) {
kproto_clusters = list.append(kproto_clusters,kproto(data_fullclust,i,lambda = 1))
}
## # NAs in variables:
## occurrencetime reportedtime premisetype offence MCI
## 0 0 0 0 0
## 0 observation(s) with NAs.
##
## # NAs in variables:
## occurrencetime reportedtime premisetype offence MCI
## 0 0 0 0 0
## 0 observation(s) with NAs.
##
## # NAs in variables:
## occurrencetime reportedtime premisetype offence MCI
## 0 0 0 0 0
## 0 observation(s) with NAs.
##
## # NAs in variables:
## occurrencetime reportedtime premisetype offence MCI
## 0 0 0 0 0
## 0 observation(s) with NAs.
##
## # NAs in variables:
## occurrencetime reportedtime premisetype offence MCI
## 0 0 0 0 0
## 0 observation(s) with NAs.
##
## # NAs in variables:
## occurrencetime reportedtime premisetype offence MCI
## 0 0 0 0 0
## 0 observation(s) with NAs.
##
## # NAs in variables:
## occurrencetime reportedtime premisetype offence MCI
## 0 0 0 0 0
## 0 observation(s) with NAs.
wss = c()
for(kc in kproto_clusters) {
wss = c(wss,kc$tot.withinss)
}
plot(seq(2,8,1),wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
We see here that, beyond 4 clusters (or 7 clusters), there seems to be diminishing returns in reducing within-cluster sum of squres. We choose 4 clusters in this case. Then plot the clustering results on a map:
kproto_selection = kproto_clusters[[3]]
data$cluster_id = as.factor(kproto_selection$cluster)
to_map <- data.frame(data$MCI, data$Lat, data$Long, data$cluster_id)
colnames(to_map) <- c('crimes', 'lat', 'lon', 'cluster_id')
sbbox <- make_bbox(lon = data$Long, lat = data$Lat, f = 0.05)
my_map <- get_map(location = sbbox, maptype = "roadmap", scale = 2, color="color", zoom = 10)
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=43.717524,-79.379169&zoom=10&size=640x640&scale=2&maptype=roadmap&language=en-EN
ggmap(my_map) +
geom_point(data=to_map, aes(x = lon, y = lat, color = cluster_id),
size = 0.5, alpha = 1) +
xlab('Longitude') +
ylab('Latitude') +
ggtitle('Place Holder')
Just by visual inspection on the map, it’s hard to see any patterns. What if we inspect the statistics within each clusters?
kproto_results <- data %>%
dplyr::select(-X,-occurrencedate,-reporteddate,-occurrencetime,-reportedtime, -Lat, -Long) %>%
group_by(cluster_id) %>%
do(the_summary = summary(.))
print(kproto_results$the_summary)
## [[1]]
## premisetype offence reportedyear
## Apartment : 9282 Assault :15231 2014: 0
## Commercial: 7296 B&E : 7704 2015:20154
## House : 7732 Theft_Of_Motor_Vehicle: 4006 2016:18518
## Other : 3958 Assault_With_Weapon : 2663 2017: 79
## Outside :10483 Robbery_-_Mugging : 1333
## B&E_W'Intent : 1093
## (Other) : 6721
## reportedmonth reportedday reporteddayofyear reporteddayofweek
## 7 : 4931 27 : 1388 179 : 194 1:5427
## 6 : 4872 16 : 1356 226 : 194 2:5728
## 5 : 4801 22 : 1355 165 : 192 3:5560
## 8 : 4613 20 : 1318 178 : 192 4:5456
## 4 : 2905 25 : 1316 180 : 186 5:5497
## 9 : 2541 18 : 1302 240 : 184 6:5676
## (Other):14088 (Other):30716 (Other):37609 7:5407
## reportedhour occurrenceyear occurrencemonth occurrenceday
## 15 : 2100 2015 :20527 7 : 4935 1 : 1676
## 17 : 2056 2016 :18159 6 : 4892 27 : 1336
## 18 : 2036 2014 : 61 5 : 4864 26 : 1318
## 16 : 2011 2013 : 4 8 : 4510 20 : 1316
## 13 : 1993 1966 : 0 4 : 2995 15 : 1312
## 20 : 1959 1968 : 0 10 : 2519 25 : 1312
## (Other):26596 (Other): 0 (Other):14036 (Other):30481
## occurrencedayofyear occurrencedayofweek occurrencehour
## 152 : 199 1:5524 0 : 2465
## 178 : 197 2:5371 12 : 2223
## 121 : 191 3:5264 20 : 2070
## 182 : 188 4:5347 21 : 2052
## 177 : 187 5:5496 22 : 2045
## 150 : 185 6:6031 18 : 2028
## (Other):37604 7:5718 (Other):25868
## MCI Division Hood_ID
## Assault :20545 D41 : 3045 75 : 1336
## Auto_Theft : 4006 D32 : 2884 77 : 1261
## Break_and_Enter: 9042 D51 : 2870 1 : 1006
## Robbery : 3788 D43 : 2818 73 : 819
## Theft_Over : 1370 D14 : 2809 76 : 811
## D31 : 2705 27 : 781
## (Other):21620 (Other):32737
## Neighbourhood cluster_id
## Church-Yonge Corridor (75) : 1336 1:38751
## Waterfront Communities-The Island (77): 1261 2: 0
## West Humber-Clairville (1) : 1006 3: 0
## Moss Park (73) : 819 4: 0
## Bay Street Corridor (76) : 811
## York University Heights (27) : 781
## (Other) :32737
##
## [[2]]
## premisetype offence reportedyear
## Apartment :28 Assault :43 2014:33
## Commercial: 4 Assault_With_Weapon :12 2015:20
## House :31 Theft_Over : 7 2016:13
## Other : 7 Theft_Of_Motor_Vehicle : 4 2017: 8
## Outside : 4 B&E : 2
## Administering_Noxious_Thing: 1
## (Other) : 5
## reportedmonth reportedday reporteddayofyear reporteddayofweek
## 1 :10 6 : 6 34 : 2 1: 5
## 3 : 9 12 : 4 69 : 2 2:15
## 4 : 9 19 : 4 110 : 2 3:18
## 10 : 9 22 : 4 237 : 2 4:13
## 6 : 7 24 : 4 288 : 2 5: 9
## 2 : 6 2 : 3 309 : 2 6:11
## (Other):24 (Other):49 (Other):62 7: 3
## reportedhour occurrenceyear occurrencemonth occurrenceday
## 12 : 9 2000 :12 1 :45 1 :56
## 11 : 8 2004 : 8 10 : 5 13 : 2
## 10 : 6 2002 : 7 4 : 4 20 : 2
## 14 : 6 2003 : 6 9 : 4 23 : 2
## 19 : 6 2001 : 5 3 : 3 24 : 2
## 13 : 5 1996 : 4 6 : 3 30 : 2
## (Other):34 (Other):32 (Other):10 (Other): 8
## occurrencedayofyear occurrencedayofweek occurrencehour
## 1 :44 1: 8 0 :42
## 182 : 2 2:16 12 :12
## 244 : 2 3: 8 8 : 3
## 275 : 2 4: 7 16 : 3
## 31 : 1 5:14 1 : 2
## 32 : 1 6: 9 9 : 2
## (Other):22 7:12 (Other):10
## MCI Division Hood_ID
## Assault :58 D54 : 8 60 : 4
## Auto_Theft : 4 D42 : 7 131 : 4
## Break_and_Enter: 2 D53 : 7 2 : 3
## Robbery : 2 D14 : 6 55 : 3
## Theft_Over : 8 D31 : 5 23 : 2
## D32 : 5 25 : 2
## (Other):36 (Other):56
## Neighbourhood cluster_id
## Rouge (131) : 4 1: 0
## Woodbine-Lumsden (60) : 4 2:74
## Mount Olive-Silverstone-Jamestown (2): 3 3: 0
## Thorncliffe Park (55) : 3 4: 0
## Dorset Park (126) : 2
## Glenfield-Jane Heights (25) : 2
## (Other) :56
##
## [[3]]
## premisetype offence reportedyear
## Apartment :8953 Assault :13461 2014:27854
## Commercial:6689 B&E : 7364 2015: 7921
## House :7359 Theft_Of_Motor_Vehicle: 4134 2016: 61
## Other :3892 Assault_With_Weapon : 2332 2017: 30
## Outside :8973 B&E_W'Intent : 1386
## Robbery_-_Mugging : 1303
## (Other) : 5886
## reportedmonth reportedday reporteddayofyear reporteddayofweek
## 3 : 4335 10 : 1286 1 : 192 1:4975
## 1 : 4222 19 : 1254 100 : 178 2:5237
## 4 : 3883 17 : 1250 93 : 171 3:5208
## 2 : 3774 1 : 1244 97 : 166 4:5170
## 5 : 2563 3 : 1243 41 : 162 5:5142
## 10 : 2533 9 : 1238 49 : 160 6:5176
## (Other):14556 (Other):28351 (Other):34837 7:4958
## reportedhour occurrenceyear occurrencemonth occurrenceday
## 15 : 2038 2014 :27733 1 : 4317 1 : 1939
## 14 : 2019 2015 : 7438 3 : 4261 18 : 1260
## 18 : 1989 2013 : 435 4 : 3764 15 : 1229
## 16 : 1973 2012 : 102 2 : 3760 17 : 1227
## 17 : 1924 2011 : 55 10 : 2563 5 : 1222
## 13 : 1827 2010 : 44 5 : 2523 14 : 1220
## (Other):24096 (Other): 59 (Other):14678 (Other):27769
## occurrencedayofyear occurrencedayofweek occurrencehour
## 1 : 393 1:5009 0 : 2466
## 91 : 197 2:4905 12 : 2102
## 60 : 176 3:5013 18 : 1959
## 31 : 172 4:5024 21 : 1877
## 32 : 169 5:5083 20 : 1808
## 92 : 168 6:5439 17 : 1799
## (Other):34591 7:5393 (Other):23855
## MCI Division Hood_ID
## Assault :17978 D41 : 2909 75 : 1100
## Auto_Theft : 4134 D43 : 2834 77 : 1049
## Break_and_Enter: 8960 D32 : 2725 1 : 931
## Robbery : 3525 D51 : 2531 73 : 690
## Theft_Over : 1269 D14 : 2499 27 : 689
## D31 : 2472 137 : 684
## (Other):19896 (Other):30723
## Neighbourhood cluster_id
## Church-Yonge Corridor (75) : 1100 1: 0
## Waterfront Communities-The Island (77): 1049 2: 0
## West Humber-Clairville (1) : 931 3:35866
## Moss Park (73) : 690 4: 0
## York University Heights (27) : 689
## Woburn (137) : 684
## (Other) :30723
##
## [[4]]
## premisetype offence reportedyear
## Apartment : 9358 Assault :14932 2014: 0
## Commercial: 7786 B&E : 7618 2015: 0
## House : 7406 Theft_Of_Motor_Vehicle: 4244 2016: 9686
## Other : 4539 Assault_With_Weapon : 3235 2017:29531
## Outside :10128 Robbery_-_Mugging : 1281
## B&E_W'Intent : 1052
## (Other) : 6855
## reportedmonth reportedday reporteddayofyear reporteddayofweek
## 10 : 5118 18 : 1404 333 : 204 1:5620
## 11 : 5114 23 : 1353 304 : 201 2:5781
## 9 : 4853 28 : 1342 323 : 201 3:5528
## 12 : 4645 24 : 1339 293 : 195 4:5581
## 8 : 2795 15 : 1335 316 : 190 5:5495
## 7 : 2721 13 : 1333 338 : 190 6:5698
## (Other):13971 (Other):31111 (Other):38036 7:5514
## reportedhour occurrenceyear occurrencemonth occurrenceday
## 15 : 2137 2017 :29211 10 : 5099 1 : 1551
## 16 : 2092 2016 : 9996 11 : 5054 18 : 1395
## 18 : 2052 2015 : 10 9 : 4896 22 : 1355
## 19 : 2050 1966 : 0 12 : 4549 24 : 1354
## 12 : 2007 1968 : 0 8 : 2876 2 : 1329
## 20 : 2006 1970 : 0 7 : 2740 15 : 1325
## (Other):26873 (Other): 0 (Other):14003 (Other):30908
## occurrencedayofyear occurrencedayofweek occurrencehour
## 323 : 200 1:5780 0 : 2358
## 259 : 194 2:5385 12 : 2203
## 293 : 194 3:5281 21 : 2125
## 289 : 193 4:5429 18 : 2122
## 304 : 193 5:5448 22 : 2089
## 328 : 190 6:5909 20 : 2080
## (Other):38053 7:5985 (Other):26240
## MCI Division Hood_ID
## Assault :20665 D32 : 2877 75 : 1544
## Auto_Theft : 4244 D51 : 2761 77 : 1375
## Break_and_Enter: 8900 D14 : 2684 1 : 1010
## Robbery : 3971 D41 : 2674 76 : 888
## Theft_Over : 1437 D31 : 2552 73 : 796
## D43 : 2481 27 : 756
## (Other):23188 (Other):32848
## Neighbourhood cluster_id
## Church-Yonge Corridor (75) : 1544 1: 0
## Waterfront Communities-The Island (77): 1375 2: 0
## West Humber-Clairville (1) : 1010 3: 0
## Bay Street Corridor (76) : 888 4:39217
## Moss Park (73) : 796
## York University Heights (27) : 756
## (Other) :32848
Again, it’s not entirely obvious how each cluster differs from one another. However, we do spot an interesting observation on the 2nd cluster. Previously we observed that people can report crimes that happened much earlier than the time when they reported. The 2nd clusters seem to identify these reports (notice the difference of occurrenceyear and reportedyear in this clusters).
Throughout the clustering analysis in part V, we have evaluate the clustering quality by ploting visual representations of the clustering and by ploting the silhouette plot. As a short summary, using police-division-based aggregation and neighborhood-based aggregation both produced clustering results where we can spot a pattern in the clustering. Yet police-division-based aggregation seems to provide a higher quality of clustering. The other two clustering strategies that we have used, on the other hand, are not so informative, at least not from what we can tell immediately.
An R Shiny app was created(please run server.R to see the app). We have produced this very informative Rshiny app which would be very interesting to those who is eager to know the crime levels and statistics in Toronto, for example, those who are looking to buy or rent properties. The app itself can be readily deployed to interested readers without much concerns regarding scaling since the dataset collected (or crime reported) are relatively limited in size unles richer dimensions are introduced to the data collecting activities.
The clusters themselves can be used for sending police to high crime areas. As more crime happens, the clusters can be updated to better understand where crime is happening in Toronto, and to see if increased police presence in certain areas helps to reduce crime.
As with other data products, maintenance is expected to some extent to keep the data and analysis updated. The following steps can be taken: (1) Collect data on an on-going basis;
(2) Collect more features and experiment if additional features could further drive up clustering quality, or prediction accuracy in case that the clustering is used as a preprocessing step for predictive analysis;
(3) Establish a way to evaluate the source of data inconsistencies during data collection activities;
(4) Rerun modeling in a timely manner.
However, we don’t need to worry about scaling.
From what we understand from the data, the clustering activities are best served as: (1) A preprocessing step for predictive analysis, which reduces the dimensions of the data and alleviates the problem of “the curse of dimensionality”; (2) Exploratory activities to help us understand the data.
For this specific task on Toronto’s crime statistics, one can readily recognize clusters based on how many offenses happened in each neighborhood or in each police division. Simple clustering algorithms such as k-means or hierarchical clustering help with recognizing and visualizing such clusters. They are an interesting way to visualize crime that happened in Toronto. Using this historical data, the Toronto Police Service can better allocate their resources to areas that have more crime, whether by neighbourhood, division, or natural hotspots.
Some areas for improvement include looking at crime incidents for neighbourhood or division based on the population of the area. An area with a higher population might have higher crime incidents simply by virtue of having more residents. However, it would be interesting to know if an area had low crime but high population. Another aspect for improvement would be to look at the demographics of the areas. This would enable us to see if there is any bias in how crime occurrences are reported, and the data could be adjusted for this bias.